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ABSTRACT 

Scientific  and  engineering  applications  often  involve  structured  meshes.  These  meshes  may  be  nested  (for 
multigrid  codes)  and/or  irregularly  coupled  (called  multiblock  or  irregularly  coupled  regular  mesh  problems). 
In  this  paper,  we  present  a  combined  runtime  and  compile-time  approach  for  parallelizing  these  applications 
on  distributed  memory  parallel  machines  in  an  efficient  and  machine-independent  fasliion.  We  have  designed 
and  implemented  a  runtime  library  which  can  be  used  »o  port  these  applications  on  distributed  memory 
machines.  The  library  is  currently  implemented  on  several  different  systems.  To  further  ease  the  task  of 
application  programmers,  we  have  developed  methods  for  integrating  this  runtime  library  with  compilers  for 
HPF-like  parallel  programming  languages.  We  discuss  how  we  have  integrated  this  runtime  library  with  the 
Fortran  90D  compiler  being  developed  at  Syracuse  University.  We  present  experimental  results  to  demonstrate 
the  efficacy  of  our  approach.  We  have  experimented  with  a  multiblock  Navier-Stokes  solver  template  and  a 
multigrid  code.  Our  experimental  results  show  that  our  primitives  have  low  runtime  communication  overheads. 
Further,  the  compiler  parallelized  codes  perform  within  20%  of  the  code  parallelized  by  manually  inserting 
calls  to  the  runtime  library. 
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1  Introduction 


In  recent  years,  distributed  memory  parallel  machines  have  been  widely  recognized  as  the  most  likely  means 
of  achieving  scalable  high  performance  computing.  Bowever,  there  are  two  major  reasons  for  their  lack  of 
popularity  among  developers  of  scientific  and  engineering  applications.  First,  it  is  very  difficult  to  pvallelize 
application  programs  on  these  machines.  Second,  it  is  not  easy  to  get  good  speed-ups  and  efficiency  on  commu¬ 
nication  intensive  applications.  Current  distributed  memory  machines  have  good  communication  bandwidths, 
but  they  also  have  high  startup  latencies  which  often  result  in  high  communication  overheads. 

Recently  there  have  been  major  efforts  in  developing  programming  language  and  compiler  support  for 
distributed  memory  machines.  Based  on  the  initial  works  of  projects  like  Fortran  D  [13,  22]  and  Vienna  For¬ 
tran  [6,41]  ,  the  High  Performance  Fortran  Forum  has  recently  proposed  the  first  version  of  High  Performance 
Fortran  (HPF)  [12].  HPF  allows  programmer  to  specify  the  layout  of  distributed  data  and  specify  parallelism 
through  operations  on  array  sections  and  through  parallel  loops.  Proposed  HPF  compilers  are  being  designed 
to  produce  Single  Program  Multiple  Data  (SPMD)  Fortran  77  (F77)  code  with  message  passing  and/or  runtime 
communication  primitives.  HPF  offers  the  promise  of  significantly  easing  the  task  of  programming  distributed 
memory  machines  and  making  programs  independent  of  a  single  machine  architecture. 

Reducing  communication  costs  is  crucial  in  achieving  good  performance  on  applications  [20,  21].  While  cur¬ 
rent  systems  like  the  Fortran  D  project.  [22]  and  the  Vienna  Fortran  Compilation  system  [6]  have  implemented 
a  number  of  optimizations  for  reducing  communication  costs  (like  message  blocking,  collective  communication, 
message  coalescing  and  aggregation),  these  optimizations  have  been  developed  only  in  the  context  of  regular 
problems  (i.e.  in  code  having  only  regular  data  access  patterns).  Special  effort  is  required  in  developing 
compiler  and  runtime  support  for  applications  that  do  not  necessarily  have  regular  data  access  patterns.  Our 
group  has  already  developed  compiler  embedded  runtime  support  for  completely  irregular  computations  (i.e. 
codes  in  which  distributed  arrays  are  accessed  based  on  indirection  arrays)  [10,  11,  18]. 

One  class  of  scientific  and  engineering  applications  involves  structured  meshes.  These  meshes  may  be 
nested  (as  in  multigrid  codes)  or  may  be  irregularly  coupled  (called  Multiblock  or  Irregularly  Coupled  Regu¬ 
lar  Mesh  Problems)  [9].  Multigrid  is  a  common  technique  for  accelerating  the  solution  of  partial-differential 
equations  [5,  30].  Multigrid  codes  employ  a  number  of  meshes  at  different  levels  of  resolution.  The  restriction 
and  prolongation  operations  for  shifting  between  different  multigrid  levels  require  moving  regular  array  sec¬ 
tions  [19]  with  non-unit  strides.  In  multiblock  problems,  the  data  is  divided  into  several  interacting  regions 
(called  blocks  or  subdomains).  There  are  computational  phases  in  which  regular  computation  is  performed 
on  each  block  independently.  Boundary  updates  require  communication  between  blocks,  which  is  restricted 
to  moving  regular  array  sections  (possibly  including  non-unit  strides). 

Multiblock  grids  are  frequently  used  for  modeling  geometrically  complex  objects  which  cannot  be  easily 
modeled  using  a  single  regular  mesh  [2,  3,  24,  29,  36].  In  Figure  1,  we  show  how  the  area  around  an 
aircraft  wing  has  been  modeled  with  a  multiblock  grid.  Multiblock  applications  are  used  in  important  grand- 
challenge  applications  like  air  quality  modeling  [28,  32],  computational  fluid  dynamics  [40],  structure  and 
galaxy  formation  [25,  38],  simulation  of  high  performance  aircrafts  [1,  8,  31],  large  scale  climate  modeling  [14], 
reservoir  modeling  for  porous  media  [14],  simulation  of  propulsion  systems  [14],  computational  combustion 
dynamics  [14],  geophysical  databases  [33],  and  land  cover  dynamics  [23]. 

In  this  paper  we  present  a  combined  runtime  and  compile-time  approach  for  parallelizing  this  general 
class  of  applications  on  distributed  memory  machines.  We  present  runtime  support  that  we  have  designed  and 
implemented  for  parallelizing  these  applications  on  distributed  memory  machines  in  an  efficient,  convenient  and 
machine  independent  manner.  We  state  the  extensions  required  to  the  current  version  of  HPF  for  handling 


block  structured  codes.  We  present  methods  by  which  the  compilers  for  HPF  style  parallel  programming 
languages  can  automatically  generate  calls  to  our  runtime  primitives.  We  discuss  how  we  have  integrated  our  *  ■.  — 
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Figure  1:  Block  structured  grid  around  a  wing,  showing  an  interface  between  blocks 

runtime  primitives  with  the  Fortran  90D  compiler  being  developed  at  Syracuse  University  [4].  While  the 
design  of  our  runtime  system  was  initially  motivated  by  multigrid  and  multiblock  applications,  our  primitives 
can  also  be  used  in  many  cases  for  parallelizing  computations  with  regular  data  access  patterns. 

We  present  experimental  results  to  demonstrate  the  efficacy  of  our  approach.  We  have  experimented  with 
one  maltiblock  application  [40]  and  one  multigrid  code  [35].  We  have  measured  the  runtime  overheads  of  our 
primitives.  We  have  compared  the  performance  of  compiler  parallelized  multiblock  and  multigrid  templates 
with  those  of  the  hand  parallelized  (i.e.  parallelized  by  inserting  calls  to  the  runtime  library  by  hand)  versions. 
Our  experimental  results  show  that  the  primitives  have  low  runtime  communication  overheads  and  the  compiler 
parallelized  codes  performs  within  20%  of  the  codes  parallelized  by  inserting  calls  to  runtime  library  by  hand. 
We  discuss  the  optimizations  that  we  have  used  to  achieve  this  performance. 

Several  other  researchers  have  also  developed  runtime  libraries  or  programming  environments  for  multiblock 
applications.  Baden  [24]  has  developed  a  Lattice  Programming  Model  (LPAR).  This  system,  however,  achieves 
only  coarse  grained  parallelism  since  a  single  block  can  only  be  assigned  to  one  processor.  Quinlan  [26]  has 
developed  P++,  a  set  of  C++  libraries  for  grid  applications.  While  this  library  provides  a  convenient  interface, 
the  libraries  do  not  optimize  communication  overheads.  Our  library,  on  the  contr^tst,  reduces  communication 
costs  by  using  message  aggregation. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  introduce  the  runtime  library  that  we  have 
developed.  In  Section  3,  we  discuss  the  regular  section  analysis  required  for  efficiently  generating  schedules 
for  regular  section  moves,  one  of  the  communication  primitives  in  our  library.  In  Section  4,  we  state  the 
extensions  required  to  the  current  version  of  HPF  and  discuss  how  the  compiler  recognizes  the  data  access 
patterns  which  can  be  handled  using  the  runtime  primitives  that  we  have  developed.  In  this  section  we  also 
discuss  the  compiler  transformations  for  automatically  inserting  the  calls  to  the  runtime  libraiy  routines.  In 
Section  5,  we  present  experimental  results  to  study  the  communication  overheads  of  our  primitives  and  the 
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effectiveness  of  the  compiler.  We  conclude  in  Section  6. 

2  Runtime  Support 

In  this  section  we  present  the  details  of  the  runtime  support  library  that  we  have  designed  for  parallelizing 
multiblock  and  multigrid  codes  on  distributed  memory  machines.  We  discuss  the  nature  of  computation 
and  communication  in  multiblock  and  multigrid  codes  and  also  describe  how  the  library  primitives  facilitate 
parallelization  of  these  applications. 

The  set  of  runtime  routines  that  we  have  developed  is  called  the  Multiblock  Parti  library  [39].  In  summary, 
these  primitives  allow  an  application  programmer  or  a  compiler  to 

•  Lay  out  distributed  data  in  a  flexible  way,  to  enable  good  load  balancing  and  minimize  interprocessor 
communication, 

•  Give  high  level  specifications  for  performing  data  movement,  and 

•  Distribute  the  computation  across  the  processors. 

We  have  designed  the  primitives  so  that  communication  overheads  are  significantly  reduced  (by  using 
message  aggregation).  These  primitives  provide  a  machine-independent  interface  to  the  compiler  writer  and 
applications  programmer.  We  view  these  primitives  as  forming  a  portion  of  a  portable,  compiler  independent, 
runtime  support  library. 

This  library  is  currently  implemented  on  the  Intel  iPSC/860,  the  Thinking  Machines’  CM-5  and  the  PVM 
message  passing  environment  for  network  of  workstations  [15].  The  design  of  the  library  is  architecture  inde¬ 
pendent  and  therefore  it  can  be  easily  ported  on  any  distributed  memory  parallel  machine  or  any  environment 
which  supports  message  passing  (e.g.  Express).  The  library  primitives  can  currently  be  invoked  from  Fortran 
or  C  programs.  Programmers  can  port  their  Fortran  or  C  programs  on  distributed  memory  machines  by 
manually  inserting  calls  to  the  library  routines.  The  resulting  program  has  Single  Program  Multiple  Data 
(SPMD)  model  of  parallelism. 

While  the  design  of  our  runtime  system  was  initially  motivated  by  multigrid  and  multiblock  applications, 
our  runtime  primitives  are  also  applicable  in  many  cases  for  regular  codes.  In  Section  4,  we  briefly  mention 
other  cases  when  our  primitives  can  be  used.  In  this  section,  however,  we  discuss  the  details  of  our  runtime 
support  in  context  of  multiblock  and  multigrid  applications. 

2.1  Multiblock  and  Multigrid  Applications 

For  a  typical  multiblock  application,  the  main  body  of  the  program  consists  of  an  outer  sequential  (time 
step)  loop,  and  an  inner  parallel  loop.  The  inner  loop  iterates  over  the  blocks  of  the  problem,  after  applying 
boundary  conditions  to  all  the  blocks  (including  updating  interfaces  between  blocks).  Applying  the  boundary 
conditions  involves  interaction  (communication)  between  the  blocks.  In  the  inner  loop  over  the  blocks,  the 
computation  in  each  block  is  typically  sweeps  over  mesh  in  which  each  mesh-point  interacts  only  with  its 
nearby  neighbors.  Since  in  these  applications,  there  are  computational  phases  which  involve  interactions  only 
within  each  block,  communication  overheads  are  reduced  if  each  block  is  not  divided  across  a  large  number  of 
processors.  So,  blocks  may  have  to  be  distributed  onto  subsets  of  processor  space.  Since  the  number  of  blocks 
is  typically  quite  small  (i.e.  at  most  a  few  dozens),  at  least  some  of  the  blocks  will  have  to  be  distributed 
across  multiple  processors.  Partitioning  of  the  parallel  loop  across  the  block  is  the  source  of  the  coarse-grained 
parallelism  for  the  application.  Furthermore,  within  each  iteration  of  the  inner  loop  fine-grained  parallelism 
is  available  in  the  form  of  (large)  parallel  loops,  iterating  over  the  elements  of  the  blocks. 


Now,  we  briefly  discuss  the  runtime  support  required  for  parallelizing  multiblock  applications.  First,  there 
must  be  a  means  for  expressing  data  layout  and  organization  on  the  processors  of  the  distributed  memory 
parallel  machine.  We  need  compiler  and  runtime  support  for  mapping  blocks  (arrays)  to  subsets  of  the 
processor  space.  Second,  there  must  be  methods  for  specifying  the  movement  of  data  required.  Two  types  of 
communication  are  required  in  multiblock  applications.  The  interaction  between  different  blocks  requires  the 
movement  of  regular  array  sections.  The  inner  loop  involves  interactions  among  neighboring  elements  of  the 
grids.  Since  blocks  may  be  partitioned  across  processors,  this  also  requires  communication.  Third,  there  must 
be  some  way  of  distributing  loops  iterations  among  the  processors  and  converting  global  distributed  arrays 
references  to  local  references. 

Multigrid  is  a  common  technique  for  accelerating  the  solution  of  partial  differential  equations.  Multigrid 
codes  employ  a  number  of  meshes  at  different  levels  of  resolution.  Multigrid  codes  have  phases  of  restriction 
(in  which  a  coarse  grid  is  initialized  based  upon  a  finer  grid),  prolongation  (in  which  a  coarse  grid  is  copied  into 
a  finer  grid  with  non-unit  stride  and  then  the  other  elements  on  the  fine  grid  are  computed  by  interpolation) 
and  sweeps  over  individual  grids.  Coarse  meshes  may  be  obtained  from  fine  grid  by  coairsening  by  the  same 
factor  or  different  factors  along  different  dimensions.  Accordingly,  each  grid  may  be  distributed  over  the  entire 
set  of  processors,  or  some  grids  may  have  to  be  distributed  over  parts  of  the  processor  space.  A  particular 
form  of  multigrid  technique  is  the  semi-coarsening  multigrid  technique  [34].  Semi-coarsening  multigrid  works 
as  follows.  Starting  from  the  finest  grid,  coarser  grids  are  generated  by  coarsening  by  different  factors  along 
different  dimensions.  There  may  be  many  grids,  having  the  same  number  of  mesh  points,  but  obtained  by 
different  coarsening  factor  along  each  dimensions  (i.e.  they  may  have  different  shapes).  In  parallelizing  such 
an  application,  communication  costs  can  be  reduced  while  maintaining  good  load  balance  by  the  following 
mapping  scheme.  The  finest  grid  is  mapped  over  the  entire  processor  space.  Different  grids  at  the  same  level 
(i.e.  having  the  same  total  number  of  mesh  points  but  obtained  by  different  coarsening  factors  along  each 
dimension)  are  mapped  to  disjoint  parts  of  the  processor  space.  In  Figure  2  we  show  how  the  semi-coarsened 
grids  are  generated  and  how  they  can  be  mapped  to  the  processor  space. 

The  runtime  support  requirements  for  the  multigrid  codes  is  as  follows.  As  with  multiblock  codes,  we  need 
to  be  able  to  map  grids  over  subsets  of  the  processor  space.  The  restriction  and  prolongation  steps  require 
regular  section  moves  between  grids  at  different  levels  of  resolution.  Again,  communication  arises  because  each 
grid  may  be  distributed  across  multiple  processors  and  computation  within  each  block  requires  near  neighbor 
interactions.  Similarly,  the  interpolation  required  during  the  prolongation  step  also  involves  interaction  among 
neighboring  elements.  Also,  support  for  distributing  loop  iterations  and  transforming  global  distributed  array 
references  to  local  references  is  required. 

2.2  Multiblock  Parti  Primitives 

We  now  discuss  the  design  of  our  runtime  library  [39].  Since,  in  typical  multiblock  and  multigrid  applications, 
the  number  of  blocks  and  their  respective  sizes  is  not  known  until  runtime,  the  distribution  of  blocks  onto 
processors  is  done  at  runtime.  The  distributed  array  descriptors  (DAD)  [4]  for  the  arrays  representing  these 
blocks  are,  therefore,  generated  at  runtime.  Distributed  array  descriptors  contain  information  about  the 
portions  of  the  arrays  residing  on  each  processor,  and  are  used  at  runtime  for  performing  communication  and 
distributing  loops  iterations.  We  will  not  discuss  the  details  of  the  primitives  which  allow  the  user  to  specify 
data  distribution.  For  more  details,  see  [39]. 

As  we  discussed  previously,  two  types  of  communication  are  required  in  both  multiblock  and  multigrid 
applications.  Inter-block  communication  is  required  because  of  boundary  conditions  between  blocks  (in  multi¬ 
block  codes)  and  restrictions  and  prolongations  between  grids  at  different  levels  of  resolution  (in  multigrid 
codes).  Since  the  data  that  needs  to  be  communicated  is  always  a  regular  section  of  an  array,  this  can  be 
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Figure  2:  Semi-coarsened  grids  and  their  mapping  to  processors 
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handled  by  primitives  for  regular  section  move.  A  regular  section  move  copies  a  regular  section  of  one  dis¬ 
tributed  array  into  regular  section  of  another  distributed  array,  potentially  involving  changes  of  offset,  stride 
and  index  permutation.  Intra-block  communication  is  required  because  of  partitioning  of  blocks  or  grids  across 
processors.  The  data  access  pattern  in  the  computation  within  a  block  or  grid  is  regular.  This  implies  that 
the  interaction  between  grid  points  is  restricted  to  nearby  neighbors.  The  interpolation  required  during  the 
prolongation  step  in  multigrid  codes  also  involves  interaction  among  the  neighboring  array  elements.  Such 
communication  is  handled  by  allocation  of  extra  space  at  the  beginning  and  end  of  each  array  dimension  on 
each  processor.  These  extra  elements  are  called  overlap,  or  ghost,  cells  [6,  16,  27].  Depending  upon  the  data 
access  pattern  in  a  loop,  the  required  data  is  copied  from  other  processors  and  is  stored  in  the  overlap  cells. 

In  our  runtime  system,  communication  is  performed  in  two  phases.  First,  a  subroutine  is  called  to  build 
a  communication  schedule  that  describes  the  required  data  motion,  and  then  another  subroutine  is  called  to 
perform  the  data  motion  (sends  and  receives  on  a  distributed  memory  parallel  machine)  using  a  previously 
built  schedule.  Such  an  arrangement  allows  a  schedule  to  be  be  used  multiple  times  in  an  iterative  algorithm. 
The  communication  primitives  include  a  procedure  Overlap.CellJ'ill^ched,  which  computes  a  schedule  that 
is  used  to  direct  the  filling  of  overlap  cells  along  a  given  dimension  of  a  distributed  array.  Communication  for 
filling  in  the  overlap  cells  has  been  implemented  in  other  systems  for  regular  computations  [6,  16,  27],  so  we 
will  not  be  discussing  the  details  here.  The  primitive  Regular^ection.CopyJSched  carries  out  the  preprocessing 
required  for  performing  the  regular  section  moves.  In  section  3,  we  discuss  the  details  of  the  regular  section 
analysis  required  for  efficiently  generating  the  schedule  for  regular  section  move. 

The  schedules  produced  by  Overlap.CellJ'illJSched  and  RegularJSection.CopyJSched  are  employed  by  a 
primitive  called  Data.Move  that  carries  out  both  interprocessor  communication  (sends  and  receives)  and 
intra-processor  data  copying. 

The  final  form  of  support  provided  by  the  multiblock  Parti  library  is  to  distribute  loop  iterations  and 
transform  global  distributed  arrays  references  into  local  references.  Our  library  distributes  loop  iterations 
based  upon  the  owners  compute  rule,  which  means  that  a  particular  loop  iteration  is  executed  by  the  processor 
owning  the  left-hand  side  array  element  written  into  during  that  iteration.  As  we  discussed  earlier,  we  prefer 
to  generate  the  Distributed  Array  Descriptor  (DAD)  for  the  arrays  at  runtime.  This  means  that  global  indices 
can  be  translated  to  local  indices  only  at  runtime  and  not  at  compile-time.  Two  primitives,  LocaLLowerJound 
and  Local-Upper Jound,  are  provided  for  transforming  loop  bounds  (returning,  respectively,  the  local  lower 
and  upper  bounds  of  a  given  dimension  of  the  referenced  distributed  array)  based  upon  the  owners  compute 
rule.  Primitives  globaLioJocal  and  localJo-global  are  also  available  for  translating  a  global  index  into  local 
index  and  translating  a  local  index  on  a  processor  into  global  index  respectively. 

3  Regular  Section  Analysis 

In  this  section  we  discuss  the  regular  section  analysis  required  for  efficiently  generating  schedules  for  regular 
section  moves  (i.e.  for  implementing  the  primitive  Regular JSeciion-CopySched).  By  regular  section  analysis 
we  mean  how  each  processor  can  determine,  for  each  other  processor,  the  exact  parts  of  the  distributed  array 
it  needs  to  send  and  receive,  given  the  source  and  the  destination  regular  sections  in  global  coordinates.  In  our 
current  system,  this  analysis  is  always  done  at  runtime.  However,  if  the  distributions  of  source  and  destination 
distributeed  arrays  and  description  of  source  and  destination  regular  sections  are  available  at  compile-time, 
then  this  analysis  can  be  done  at  compile-time  as  well.  In  separate  works,  Chatterjee  ei  al  [7],  Stichnoth  [37] 
and  Gupta  et  al  [17]  have  developed  compile-time  methods  for  analyzing  and  generating  communication 
associated  with  HPF’s  forall  statements  and/or  F90  style  array  expressions.  While  their  solutions  work  for 
the  general  case  when  the  data  distributions  are  block-cyclic,  their  methods  require  that  the  data  distribution 
be  known  at  the  compile-time  and  the  exact  description  of  the  statement  be  available  in  terms  of  compile-time 
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constants.  We  are  interested  in  techniques  which  can  be  used  at  runtime  and  are  specialized  towards  the 
particular  communication  patterns  associated  with  multigrid  and  multiblock  applications.  Also,  since  these 
applications  typically  need  block  distribution,  we  restrict  to  describing  our  analysis  when  the  data  is  block 
distributed  along  each  dimension. 

We  assume  that  the  array  indices  always  start  from  0  (as  in  the  C  programming  language).  The  processors 
owning  source  (or,  destination)  array  can  be  viewed  as  forming  an  r-dimensional  virtual  processor  grid.  A 
processor  p  in  this  processor  grid  has  coordinates  (pi.p^i . . .  ,Pr}-  We  assume  that  the  numbering  starts  from 
zero  in  each  dimension  in  the  processor  grid. 

The  source  regular  section,  denoted  by  S,  is  part  of  the  source  distributed  array  s. 

S  =  {(sJoi  :  sMi  :  s.siri),  (sJo^  ■  S-hi^  :  . . . ,  {sJor  ;  S-hiV  :  s_s<rr)} 

sJoi,  s.hii  and  s-slrj  are  respectively  the  lowest  index,  highest  index  and  the  stride  along  the  t**  dimension 
(in  global  indices).  The  regular  section  S  defines  a  set  of  array  elements.  An  array  index  et  along  the  t*^ 
dimension  is  said  to  be  a  part  of  the  regular  section  iff  B/,  (an  integer)  s.t. 


Ci  =  sJoi  +  /,•  •  s_str,- , 


(3.0.1) 


where. 


0  <  /•  < 


sJiii  —  sJoi 
s.stri 


An  array  element  whose  indices  are  (cj ,  62, . . . ,  e,)  belongs  to  the  regular  section  S  iff, 

^  •>  1  ^  £  ’■)  •'^1®  array  index  e,-  along  the  dimension  belongs  to  the  regular  section  S. 

We  will  use  this  format  to  describe  all  regular  sections.  The  destination  regular  section,  denoted  by  V,  is 
a  part  of  the  destination  array  d. 


V  =  {(dJoi  :  dJii\  :  d^tri),  (dJo2  ■  dJhi^  :  d-sfr2), . . . ,  (dJor  :  dJiir  :  dMvr)) 

Regular  section  moves  can  involve  index  permutations.  We  denote  by  tm(i)  the  destination  dimension 
which  is  accessed  by  the  same  loop  index  as  the  »**  source  dimension. 

For  each  processor  owning  part  of  the  source  regular  section,  we  want  to  determine  the  set  of  local  elements 
that  it  will  be  sending  to  each  processor  owning  part  of  the  destination  regular  section.  We  call  these  sets  of 
elements  send  sets.  Similarly,  for  each  processor  owning  part  of  the  destination,  we  want  to  determine  the  set 
of  local  elements  that  it  will  be  receiving  from  each  processor  owning  part  of  the  source.  We  call  these  sets  of 
elements  receive  sets.  Here  we  just  discuss  the  analysis  that  a  particular  source  processor  does  to  compute  the 
send  sets.  The  analysis  for  determining  the  receive  sets  is  completely  analogous  and  is  therefore  not  described. 

The  steps  we  follow  for  computing  the  send  sets  are  as  follows.  For  a  processor  p,  we  determine  the  part 
of  regular  section  S  that  it  owns,  that  is,  we  restrict  the  section  S  on  the  processor  p  (which  is  denoted  by 
5'(p)).  Next,  we  take  a  transformation  of  the  section  S'{p))  to  map  it  from  the  source  regular  section  S 
to  the  destination  regular  section  V.  The  resulting  section  is  denoted  by  T>'{p).  We  next  determine  the  set 
of  destination  processors  which  own  part  of  the  section  V'{p),  i.e.  the  destination  processors  to  whom  the 
processor  p  will  be  communicating.  For  each  such  processor  q,  we  restrict  the  section  P'(p)  to  determine  the 
part  that  q  owns,  calling  it  V"{p,q).  In  the  last  step,  the  section  'D"{p,q)  is  mapped  back  to  the  source,  the 
resulting  section  is  denoted  by  S"{p,q).  S"{p,q)  is  the  send  set  that  the  source  processor  p  will  be  sending  to 
the  destination  processor  q. 

We  now  present  the  details  of  the  steps  mentioned  above.  We  consider  a  particular  processor  p  which  owns 
part  of  the  source  array  s,  of  which  the  regular  section  5  is  a  part.  Let  llof{p)  and  lhif{p)  be  the  lowest 
and  the  highest  points  along  the  dimension  (in  global  indices)  that  the  processor  p  owns.  Since  the  data 
distribution  is  block,  the  processor  p  owns  a  contiguous  chunk  of  data  from  //of(p)  to  lhif{p). 
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3.1  Restricting  S  to  the  Processor  p 

We  now  compute  the  part  of  the  regular  section  S  that  the  processor  p  owns.  This  is  denoted  by  S*{p). 
Through-out  our  discussion,  all  regular  sections  will  always  be  described  in  global  indices.  Given  the  global 
coordinates,  auiy  individual  processor  can  always  determine  the  corresponding  local  (on  processor)  indices. 


5'(P)  =  {(»-^oi(p),  aJii\{p), s_str',(p)), . . . ,  s^tr;(p))} 


where. 


8Jo'i(p)  =  aJoi  +  - ^  ^ ^  •  aMn 

(3.1.1) 

s_hij(p)  =  Tnin{lhif(p),8jiii) 

(3.1.2) 

sjBtrf{p)  =  a_sfri 

(3.1.3) 

Since  the  data  is  block  distributed,  s-str|.(p)  is  sjatti.  8Jo{{p)  is  the  first  index  along  the  dimension 
which  is  part  of  the  regular  section  S  and  is  owned  by  the  processor  p.  In  the  calculation  of  sJo|(p)  there  are 
two  cases,  depending  upon  whether  sJoj  >  Hc^(p)  or  sJoi  <  llof(p).  If  sJoj  >  Hof(p)  then  the  index  sJoj 
is  on  the  processor  p.  Therefore,  sJo^{p)  =  sJoi.  Alternatively,  if  sJoi  <  //of(p),  then  the  expression  for 
sJoi(p)  reduces  to  «_/o,  -f  \(llof(p)  —  s-/oj)/s_str,'|  •  sstri.  This  is  the  first  index  after  llof(p)  which  is  part 
of  the  regular  section.  Note  that  s_/it|(p)  is  not  necessarily  a  member  of  the  regular  section  S.  It  just  needs 
to  be  greater  than  the  last  member  of  regular  section  on  the  processor  q,  so  that  the  loop  accessing  successive 
indices  in  the  section  terminates  correctly. 

If  the  processor  p  does  not  own  any  part  of  the  regular  section  along  the  dimension,  then  the  above 
expressions  will  give  a  value  of  s-/o<(p)  which  is  greater  than  the  value  of  8jii[{p).  In  general,  if 


then  the  processor  p  does  not  own  any  part  of  the  regular  section  S. 


3.2  Mapping  S\p)  to  the  Destination 

Next,  we  determine  the  corresponding  section  in  the  destination  array  (i.e.  put  of  the  destination  regular 
section  that  will  be  received  from  the  processor  p).  This  is  denoted  by  V{p). 

=  {(<<-^01  (P).  ‘^-^«Up)><^-**»''i(p)).  •  •  •  >  '^-A»r(P).  '<-»<»'r(P))} 


where. 


j  =  im(i) 

(3.2.1) 

dJo'Ap)  =  dJoj  -k-  djitrj 

’  ^  s-sfrj 

(3.2.2) 

=  ./Jo, + 

(3.2.3) 

djatr'j{p)  =  djBtrj 

(3.2.4) 

Since  the  array  is  distributed  by  blocks,  dMr'^{p)  is  dMrj.  The  expressions  for  dJo'^{p)  and  dJiij{p) 
follow  from  finding  the  indices  in  the  destination  regular  section  which  correspond  to  the  indices  sJo|(p)  and 
s-hi|(p),  respectively,  in  the  source  regular  section. 
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3.3  Restricting  V{p)  to  Processor  q 

We  first  determine  the  set  of  processors  to  which  the  source  processor  p  will  send  data.  The  processors  owning 
parts  of  the  destination  array  form  an  r-dimensional  virtual  processor  grid.  A  processor  q,  owning  part  of  the 
destination  array,  has  coordinates  {gi,g2i  •  •  ijr}  in  this  processor  grid.  We  assume  that  each  processor  owns 
size,  indices  adong  the  t**  dimension. 

We  denote  by  qjniTii{p)  and  qjnaxi(p)  the  lowest  and  highest  coordinates  along  the  dimension  of  the 
processors  which  own  part  of  the  regular  section  ‘D'(p). 


qjmini{p)  =  -  1  (3.3.1) 

I  StZCi  I 

qMiaxiip)  =  +  n  _  1  (3.3.2) 

I  StZCi  \ 

A  processor  q  having  coordinates  {qi,qi,  ■  •  -  will  receive  data  from  the  source  processor  p  iff 

V*.  ?-»w»««(p)  <  9t  <  qjnaxi{p) 

Consider  a  particular  processor  q  which  will  receive  data  from  the  source  processor  p.  Suppose  that  the 
start  and  end  points  along  the  i‘*  dimension  on  this  processor  are  llof(q)  and  lhif{q)  respectively.  We  denote 
the  part  of  the  destination  regular  section  that  the  processor  q  will  receive  from  the  processor  p  by  'D"{p,  q). 

9)  =  {(‘^-/oi'(p,  9).  dJii'{(p,  q),djBtr'{{p,  ?)), ....  (dJo"(p,  q),  dJii'^{p,  q),  dMr'^{p,  g))} 


where, 


djo;'(p,9) 

dJ«T(p.9) 
d-str"(p,  q) 


dM  + 


majc(0,  Uofjq)  -  dJo'jjp)) 


djstvi 
min{lhif{q),dJiii{p)) 
d^tvi 


■  dMvi 


(3.3.3) 

(3.3.4) 

(3.3.5) 


The  reasoning  behind  the  correctness  of  the  above  expressions  is  the  same  as  that  used  in  determining  S' 
from  S,  as  discussed  in  Section  3.1. 


3.4  Mapping  T>"{p,q)  to  the  Source 

Next,  we  determine  the  equivalent  part  of  the  regular  section  V"{jp,q)  on  the  source  side  (i.e.  the  part  of  the 
source  regular  section  which  the  processor  p  sends  to  the  processor  q).  We  denote  this  by  S"{p,q). 

«5"(P.  9)  =  {(«-^Oi' (P.  9),  sJii'l (p,  9).  sMr'{{p,  q)), . . . ,  (sJo"(p,  q),  sJii''{p,  q),  s-sfr"(p,  g))} 

where, 


j  =  im(i) 

(3.4.1) 

„  <iJo''(p,  q)  —  dJoj 

sJo'/{p,q)  =  sJoi  +  ^  dMr- 

(3.4.2) 

„  1  <iM'l{p,  g)  -  dJoj  1 

s-ht'i  (p,  g)  =  sJoi  +  ^  J  •  s.stri 

(3.4.3) 

s^tr"{p,  g)  =  s_s<r, 

(3.4.4) 

The  reasoning  behind  the  correctness  of  the  these  expressions  is  the  same  as  that  used  in  determining  V'{p) 
from  S'{p)  in  Section  3-2. 
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3.5  Discussion 

All  the  calculations  described  above  are  performed  by  the  processor  p  locally  amd  do  not  involve  communication 
with  any  other  processor.  Therefore,  send  and  receive  sets  can  be  generated  efficiently.  Based  on  the  calculation 
of  S",  the  processor  p  knows  the  contents  of  the  message  that  it  must  send  to  processor  q.  However,  when 
processor  q  receives  this  message,  it  does  not  have  any  information  about  which  local  memory  locations  each 
element  of  the  message  must  be  copied  into.  To  facilitate  this,  each  destination  processor  computes  the  set 
of  (local)  elements  that  it  will  receive  from  each  source  processor.  The  calculations  for  computing  these 
receive  sets  are  completely  analogous  to  the  computations  for  the  send  sets.  Therefore,  we  do  not  describe 
the  computation  of  the  receive  sets  here.  The  source  processor  p  always  sends  the  set  of  elements  it  needs  to 
send  to  the  processor  q  in  a  single  message,  packed  in  the  column  major  fashion.  Processor  q  can  then  use  the 
receive  set  information  to  copy  the  elements  in  the  received  message  into  the  appropriate  local  elements. 

An  alternative  to  this  scheme  is  that  the  message  sent  by  the  source  processor  p  also  contains  information 
about  what  local  memory  location  at  the  destination  processor  q  each  of  elements  packed  in  the  message  needs 
to  be  copies  to.  The  destination  processor  q  can  then  copy  elements  of  the  message  into  its  local  elements  based 
upon  this  information.  This  approach  does  save  some  computation  at  the  destination  processors.  However, 
the  size  of  the  messages  increases  significantly  because  of  tbe  extra  information  that  needs  to  be  sent.  In 
our  implementation,  we  have  chosen  to  compute  both  the  send  and  receive  sets,  since  on  current  distributed 
memory  machines,  this  is  less  expensive  than  communicating  the  receive  set  information. 

3.6  Example 

Consider  a  regular  section  move  that  involves  a  source  array  of  size  100  *  100  and  a  destination  array  of  size 
50  *  100.  The  source  array  is  block  distributed  over  a  2  ♦  2  virtual  processor  grid  and  the  destination  array 
is  block  distributed  over  a  4  *  1  virtual  processor  grid.  The  source  and  the  destination  regular  sections  are: 
S  =  {(10  :  60  :  2),  (10  ;  70  :  3)}  and,  T>  =  {(10  :  30  :  1),  (5  :  80  :  3)}.  The  first  dimension  of  the  source  regular 
section  is  aligned  to  the  second  dimension  of  the  destination  regular  section  and  the  second  dimension  of  the 
source  regular  section  is  aligned  to  the  first  dimension  of  the  destination  regular  section  i.e.  tm(l)  =  2  and 
im(2)  =  1. 

We  consider  the  source  processor  p  with  coordinates  {1,0}.  The  part  of  the  global  array  that  this  processor 
owns  is  //of  (p)  =  50,  lhif{p)  =  99,  //of (p)  =  0  and  /Atf(p)  =  49. 

The  part  of  the  source  regular  section  that  processor  p  owns  (5'(p))  is  given  by 

5'(p)  =  {(50:60:2),(10:49:3)} 

The  corresponding  section  on  the  destination  side  {V'{p))  is  given  by 

P'(p)  =  {(10:23:l),(65:80:3)} 

Next,  the  processor  p  determines  the  set  of  destination  processors  with  whom  it  will  be  communicating. 
We  have  for  the  destination  array,  sizei  =  50  and  stzej  =  25. 

This  gives,  p_mtni(p)  =  0,  p_maii(p)  =  0,  p_min2(p)  =  2,  and  p_max2(p)  =  3.  The  destination 
processors  the  source  processor  p  will  communicate  with  are  tbe  ones  with  grid  coordinates  {0,2}  and  {0,3}. 
Consider  the  destination  processor  q  with  coordinates  {0,3}.  The  part  of  the  destination  array  that  processor 
q  owns  is  given  by  lloi{q)  =  0,  lhi^{q)  =  49,  Uo^iq)  =  75  and  lhi^{q)  =  99. 

The  part  of  regular  section  which  the  processor  p  will  be  sending  to  the  processor  q  {P"{p,  ?))  is  given  by 

^>"(p.9)  =  {(10:23:1),(77:80:3)} 
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The  corresponding  source  section  (S"(p,  q))  is  now  given  as 


S"(p,q)  =  {(58  :  60  ;  2),  (10  :  49  :  3)} 


4  Compiler  Support 

In  this  section  we  first  discuss  the  additional  functionality  required  in  the  current  version  of  HPF  to  support 
multiblock  and  multigrid  codes.  We  describe  how  a  compiler  can  analyze  the  data  access  patterns  associated 
with  a  loop,  to  recognize  communication  patterns  which  can  be  handled  using  the  runtime  primitives  for 
multiblock  problems.  We  then  describe  the  compiler  transformations  for  generating  the  calls  to  these  runtime 
primitives.  We  also  briefly  discuss  how  loop  iterations  are  distributed  to  achieve  parallelism. 

4.1  Language  Support 

The  current  version  of  HPF  does  not  support  all  the  functionality  required  for  multiblock  and  multigrid  ap¬ 
plications.  In  multiblock  problems,  the  problem  geometry  is  divided  into  a  number  of  blocks  of  different  sizes. 
As  we  have  discussed  in  the  previous  sections,  each  of  these  blocks  needs  to  be  distributed  onto  a  portion 
of  the  processor  space.  Similarly,  in  multigrid  codes,  communication  overheads  can  typically  be  reduced  by 
distributing  each  coarse  grid  over  a  part  of  the  processor  space  [35].  The  current  version  of  HPF  does  not 
provide  any  convenient  mechanism  for  distributing  arrays  (or  templates)  onto  a  part  of  the  processor  space. 
We  therefore  need  additional  functionality  for  conveniently  distributing  arrays  onto  part  of  the  processor  space. 
In  HPF,  the  programmer  declares  an  abstract  processor  space  by  using  the  processor  directive: 

!HPF$  PROCESSORS  P(N) 

In  general,  the  abstract  processor  space  can  have  any  number  of  dimensions.  To  support  block  structured 
applications,  we  need  to  be  able  to  specify  processor  subspaces.  We  declare  a  processor  subspace  as  follows: 

!HPF$  PSUBSPACE  PI  IS  P(LB:UB) 

The  above  directive  states  that  PI  is  the  part  of  the  processor  space  P  which  starts  at  processor  LB  and 
ends  at  processor  UB.  In  addition,  if  the  processor  subspace  PI  is  created  from  the  processor  space  P,  then 
PI  must  have  the  same  number  of  dimensions  as  P.  Since  the  sizes  of  the  blocks  are,  in  general,  not  known  at 
compile-time,  the  subspace  directive  must  be  executable,  so  that  the  parameters  do  not  have  to  be  compile¬ 
time  constants.  Once  a  processor  subspace  has  been  declared,  arrays  or  templates  can  be  distributed  onto  it. 
For  example, 

!HPF$  TEMPLATE  T(100,100) 

!HPF$  DISTRIBUTE  T(BLOCK, BLOCK)  ONTO  PI 

Similar  functionality  is  available  in  Vienna  Fortran  [41],  where  a  distribution  can  be  mapped  to  a  processor 
reference.  We  prefer  to  explicitly  name  the  processor  subspace  since  it  makes  it  easier  to  detect  at  compile-time 
which  arrays  or  templates  have  been  mapped  to  the  same  processor  subspace,  even  if  the  exact  size  of  the 
subspace  is  not  available  as  compile-time  constants. 

With  the  block  distributions  supported  in  the  current  version  of  HPF,  the  entire  array  gets  distributed 
uniformly  across  the  processors  of  the  distributed  memory  parallel  machine.  This  may  not  be  ideal  for  load 
balancing  for  many  applications.  While  the  programmer  may  declare  a  large  array,  not  all  the  elements  of  the 
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array  may  be  actual  mesh  points  participating  in  computation.  Some  of  the  array  elements  at  both  ends  of 
each  dimension  may  be  used  for  participating  in  exchanges  between  blocks.  We  refer  to  such  array  elements 
as  eztemai  ghost  cells.  For  example,  the  actual  declared  arrays  for  a  given  block  may  be  52  x  12  x  12,  with  two 
external  ghost  cells  at  the  beginning  and  end  of  each  dimension.  This  means  that  the  actual  mesh  representing 
the  computation  is  of  size  48  x  8  x  8.  It  is  these  mesh  points  which  must  be  distributed  evenly  across  the 
processors  onto  which  the  block  is  distributed,  so  that  the  computation  load  will  be  evenly  balanced.  The 
external  ghost  cells  at  both  ends  of  each  dimension  are  then  stored  at  the  first  and  last  processor  along  that 
dimension  in  the  processor  space.  For  example,  if  an  array  with  8  elements,  plus  two  external  ghost  cells  on 
each  end  (for  a  total  of  12  elements),  is  distributed  on  4  processors,  we  would  like  to  store  2  mesh  points  on 
each  processor  along  that  dimension.  The  first  and  last  processors  can  then  store  the  external  ghost  cells  at 
the  beginning  2uid  end,  respectively.  This  results  in  a  much  better  load  balance  than  simply  distributing  3 
array  elements  onto  each  processor  (which  will  result  in  the  first  and  last  processors  having  only  1  real  mesh 
point  and  the  intermediate  processors  having  3  real  mesh  points  each). 

The  current  version  of  HPF  does  not  provide  any  mechanism  for  specifying  external  ghost  cells.  We  need 
additional  functionality  in  the  align  statement  to  express  them.  We  do  this  by  explicitly  specifying  the  number 
of  external  ghost  cells  at  the 'beginning  and  end  of  each  dimension: 

!HPF$  DIMENSION  A(105,105) 

!HPF$  ALIGN  A(iJ)  WITH  T(i:2:3J:2:3) 

This  example  says  that  an  array  of  size  105x105  is  aligned  along  a  template  of  size  100x100,  with  2  ex¬ 
ternal  ghost  cells  at  the  beginning  of  each  dimension  and  3  external  ghost  cells  at  the  end  of  each  dimension. 
If  the  template  T  is  distributed  by  blocks  onto  a  two  dimensional  processor  sp^e,  A(3: 102,3:102)  also  gets 
distributed  in  the  same  fashion.  Note  that  our  purpose  here  is  not  to  introduce  new  syntax  but  to  achieve 
the  additional  functionality  that  we  need.  We  believe  that  this  functionality  will  be  added,  in  some  form,  in 
a  future  version  of  HPF, 

4.2  Identifying  Communication  Patterns 

In  this  subsection  we  discuss  how  the  compiler  identifies  the  communication  patterns  which  can  be  handled 
using  the  runtime  support  for  multiblock  problems.  Note  that  our  purpose  is  not  to  provide  a  general  framework 
for  compiling/ora/Istatements;  we  are  only  interested  in  recognizing  the  patterns  that  can  be  handled  efficiently 
using  the  primitives  we  have  developed. 

While  we  have  designed  the  runtime  support  with  multiblock  and  multigrid  codes  in  mind,  the  runtime 
primitives  can  also  be  used  to  efficiently  handle  communication  for  many  other  types  of  applications.  Regular 
section  move  primitives  can  be  used  for  handling  the  communication  required  when  a  distribution  of  an  array 
is  changed  (using  the  redistributed  statement  of  HPF  [12].  They  can  also  be  used  to  handle  the  communication 
required  for  filling  ghost  cells  when  the  data  distribution  is  cyclic  or  block-cyclic.  The  primitives  can  also  be 
used  for  handling  communication  in  forall  loops  and  array  expressions  in  many  regular  applications,  especially 
when  strides  are  involved. 

We  do  not  consider  applications  in  which  indirection  arrays  may  need  to  be  analyzed  to  identify  commu¬ 
nication  patterns.  The  irregular  communication  arising  from  use  of  indirection  arrays  can  be  handled  using 
the  Parti  primitives  for  irregular  problems  [10],  which  have  also  been  integrated  with  compilers  for  HPF-style 
languages  (including  the  Rice  University  Fortran  77D  compiler  [18]  and  the  Syracuse  University  Fortran  90D 
compiler  [4]).  F90D  and  HPF  also  provide  a  number  of  intrinsic  functions  (such  as  reduction,  spread,  etc.). 
We  assume  that  if  a  computation  can  be  done  using  these  intrinsics,  it  is  either  written  this  way  by  the 


12 


programmer  or  is  recognized  by  the  compiler  in  an  early  phase  of  the  compilation. 

HPF  diows  multiple  statement  forall  loops  and  array  expressions  for  expressing  parallelism.  We  restrict  our 
discussion  to  the  problem  of  analyzing  a  single  statement  forall  loop  for  communication  patterns.  A  multiple 
statement  forall  loop  is  just  like  a  single  statement  forall  loop,  with  the  multiple  assignment  statements  all 
having  the  same  loop  header.  The  same  analysis  can  be  done  for  each  assignment  statement  within  the  forall 
loop.  The  array  expressions  provided  by  HPF  can  always  be  translated  into  equivalent  forall  loops. 

We  classify  the  data  access  patterns  associated  with  a  forall  loop  as  being  one  of  three  kinds: 

•  Completely  regular  (not  involving  any  communication). 

•  Ones  that  can  be  handled  by  filling  in  overlap  cells. 

•  Ones  that  requires  regular  section  moves. 

Consider  any  forall  statement  with  array  expressions  involving  an  array  A  in  the  left  hand  side  and  an 
array  B  as  one  of  the  arrays  on  the  right  hand  side.  The  form  of  the  forall  statement  is  assumed  to  be  as  follows: 

forall  (ii  =  loi  :  hi  y  :  six,  ,  i„,  =  lom  :  hi^  ■  «<m) 

A(/l,  /2,  .  .  =  •  •  B(5i,  -  - 

The  iic,  {k  =  l..m)  are  the  loop  variables  associated  with  the  forall  statement,  lot,  hit  and  stt  are 
respectively  the  lower  bound,  upper  bound  and  the  stride  for  each  loop  variable.  For  the  left  hand  side  array 
A,  /i,  fi,  ..  ,fj  are  the  subscripts.  Similarly,  for  the  right  hand  side  array  B,  pj,  §2,  ..  ,gn  are  the  subscripts. 
The  form  of  the  array  subscripts  /  and  g  is  assumed  to  be: 

ft  =  clti'U  +  dlt 

gk  =  c2ti2t  +  d2t 

Here,  flit  and  i2t  are  loop  variables.  If  a  subscript  is  a  loop  invariant  scalar,  then  we  say  that  the  loop 
variable  il*  (or,  i2t  ),  is  4>  and  cl*  (or,  c2*)  is  0.  cl*,  c2*,  dl*  and  d2t  may  be  expressions,  but  we  assume 
that  they  do  not  involve  any  loop  variable.  Our  primitives  are  not  applicable  for  cases  in  which  multiple  loop 
variables  are  associated  with  a  particular  array  subscript  or  when  the  same  loop  variable  appears  in  more  than 
one  subscript  for  a  particular  array  or  when  a  subscript  is  a  higher  order  function  of  a  loop  variable.  Such 
cases  can  be  handled  by  using  the  Parti  primitives  for  irregular  problems.  Also,  the  HPF  specification  allows 
the  lower  bound,  upper  bound  and  stride  expressions  for  each  loop  variable  to  be  evaluated  in  any  order. 
Consequently,  the  lower  bound,  upper  bound  and  stride  for  any  loop  variable  are  not  allowed  to  be  a  function 
of  any  other  loop  variable.  It  is  possible,  in  general,  that  a  loop  variable  may  appear  only  in  the  right  hand 
side  array  or  only  in  the  left  hand  side  array.  If  a  particular  loop  variable  appears  only  in  the  right  hand  side, 
this  represents  successive  overwrites  on  the  same  memory  location  of  the  left  hand  side  array.  Such  code  is  not 
likely  to  appear  in  practice  and  therefore,  we  do  not  consider  this  case.  If  a  particular  loop  variable  appears 
only  in  the  left  hand  side  array,  this  represents  a  spread  operation.  We  ^lssume  that  it  is  written  using  the 
intrinsic  spread  operation,  and  is  not  a  part  of  the  forall. 

Depending  upon  how  the  arrays  A  and  B  are  distributed  and  aligned  with  respect  to  each  other,  we  consider 
three  different  cases.  These  are: 

Case  I:  Arrays  A  and  B  are  aligned  to  different  templates. 

C^ase  II:  Arrays  A  and  B  are  identically  aligned  to  the  same  template.  This  case  also  requires  that  A 
and  B  are  arrays  of  identical  shape  and  size,  i.e.  having  the  same  number  of  dimensions  and  the  same 
size  in  each  dimension. 


13 


Case  III:  Arrays  A  and  B  are  adigned  to  the  same  template,  but  with  a  different  stride,  offset  and/or 
index  permutation.  This  means  that  the  the  arrays  A  and  B  are  mapped  to  the  same  processor  subspace, 
but  each  in  a  different  manner. 

We  now  discuss  how  the  data  access  pattern  associated  with  each  of  these  cases  is  analyzed.  The  transfor¬ 
mations  required  for  generating  calls  to  the  runtime  primitives  are  discussed  in  Section  4.3. 

4.2.1  Case  I 

Since  the  arrays  ue  aligned  to  distinct  templates,  the  communication  is  always  handled  using  the  regular 
section  move  primitive  from  the  runtime  library.  We  expect  that  if  a  user  has  declared  distinct  templates  then 
they  are  either  distributed  over  different  processor  subspaces,  or  have  a  different  number  of  distributed  dimen¬ 
sions.  Therefore,  there  is  no  regularity  in  the  communication  associated  with  a  fonll  statement  containing 
references  to  such  arrays. 

It  is  possible  that  a  programmer  may  create  more  than  one  template  with  the  same  number  of  distributed 
dimensions,  distributed  over  the  same  processor  subspace.  We  can  extend  our  analysis  to  consider  the  processor 
subspace  over  which  distinct  templates  are  distributed  in  determining  any  regularity  in  the  communication 
required.  However,  we  do  not  discuss  this  possibility  here. 

4.2.2  Caae  11 

The  data  access  patterns  associated  with  this  case  may  be  completely  regular,  or  may  require  the  overlap  cells 
to  filled  in,  or  may  require  a  regular  section  move. 

Let  DD(A)  denote  the  set  of  dimensions  of  the  array  A  which  are  distributed.  Under  the  assumptions  for 
Case  II,  DD(B)  =  DD(A).  In  terms  of  the  form  of  the  /ore// statement  and  the  array  subscripts  that  presented 
in  Section  4.2,  the  condition  for  the  communication  associated  with  the  forall  to  require  a  regular  section  move 
is  : 

3j  €  DD(A)  8.t. 

1.  il,  #  i2j  ,  or, 

2.  elj  ^  c2j  ,  or, 

3.  dlj  ^  d2j  and  either  dlj  or  d2j  is  not  a  compile-time  constant,  or, 

4.  tlj  =  i2j  =  4  *»td  dlj  ^  dlj. 

The  first  condition  states  that  there  is  loop  index  permutation.  In  that  case,  a  regular  section  move  will  be 
required.  The  second  condition  states  that,  along  the  dimension,  the  elements  of  the  arrays  A  and  B  2ue 
being  accessed  with  different  strides.  Again,  this  case  will  require  a  regular  section  move.  The  third  condition 
corresponds  to  the  fact  that  there  are  non-constant  of&ets.  If  there  are  constant  offsets,  then  only  the  overlap 

celk  need  to  be  filled  in.  For  overlap  cells,  space  needs  to  be  allocated  at  compile-time,  so  the  number  of  overlap 

cells  must  be  known  at  the  compile-time.  If  the  offisets  are  not  compile-  time  constants,  then  we  use  a  regular 
section  move  to  handle  communication.  This  situation  can  also  be  handled  by  shifts  into  a  temporary  array  [4]. 
The  fourth  condition  says  that  along  dimension  j  a  loop  variable  does  not  appear  in  either  the  left  hand  side  or 
the  right  hand  side  index  and  the  loop  invariant  scalars  are  different.  This  represents  a  copy  from  one  location 
to  another,  but  because  of  the  loop  variables  associated  with  other  dimensions,  will  typically  require  a  reguliu 
section  of  data  to  be  moved.  Since  the  distributed  array  descriptor  is  not  available  at  compile-time,  it  cannot 
be  determined  at  compile-time  whether  this  data  move  will  require  any  interprocessor  communication.  So  we 
handle  this  kind  of  data  move  using  the  regular  section  move  primitives  we  have  already  discussed. 

The  data  access  pattern  requires  filling  in  overlap  cells,  if  the  following  condition  holds: 
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Arrays  A  and  B  are  aligned  identically 


L.H.S. 

Expression 

R.H.S. 

Expression 

Regular  Section 
Move  Required 

Overlap  Cell 
Fill  Required 

A(ij) 

B(j+2,i+l) 

YES 

NO 

A(ij) 

B(2*ij) 

YES 

NO 

A(ij) 

B(i+nl  j+2) 

YES 

NO 

A(ij) 

B(i+lj+2) 

NO 

YES 

A(ij) 

B(iJ) 

NO 

NO 

Figure  3:  Analyzing  communication  for  Case  II 

1.  A  regular  section  move  is  not  required  and 

2.  3j  e  DD{A)  s.t.  dlj  9^  d2j. 

The  second  condition  states  that  there  is  a  difference  in  the  oihsets  along  some  (distributed)  dimension. 
Overlap  cells  must  he  filled  along  each  dimension  in  which  there  is  a  difference  in  the  ofbets.  In  Figure  3  we 
show  examples  for  the  different  possibilities  within  case  II,  for  identically  aligned  two  dimensional  arrays  A 
and  B. 


4.2.3  Case  lU 

In  this  case,  arrays  A  and  B  are  aligned  to  the  same  template  (T),  but  in  different  fashions.  We  consider 
only  the  cases  when  A  and  B  are  aligned  to  T  in  such  a  way  that  none  of  the  dimensions  of  either  A  or  B  is 
replicated.  In  this  case,  the  number  of  distributed  dimensions  of  A  and  B  would  be  identical,  and  will  be  equal 
to  the  number  of  distributed  dimensions  of  T.  Consider  any  distributed  dimension  j  of  A  (i.e.  j  €  DD{A)). 
We  use  AD{A,j)  to  denote  the  dimension  of  the  template  T  along  which  the  distributed  dimension  j  of  the 
array  A  is  aligned.  We  use  map(j)  to  denote  the  dimension  of  the  right  hand  side  array  B  which  is  aligned  to 
the  same  dimension  of  the  template  T  as  dimension  j  of  the  array  A.  Formally, 

map{j)  =  k  •<=>•  31  s.t.  AD{A,j)  =  1  A  AD{B,k)  =  I  . 

Since  each  of  the  dimensions  of  A  and  B  are  distributed  along  exactly  one  dimension  of  the  template  T  (as 
required  by  HPF),  map{j)  is  defined  and  is  unique  for  each  distributed  dimension  of  A. 

For  the  purpose  of  the  discussion,  we  assume  that  the  arrays  A  and  B  are  aligned  as  follows: 

!HPF$  ALIGN  A  {ku  ■  ■  ■  ,kj, . .  .k,)  WITH  T{hli,...,hlj'  :ExtJowlj  :ExtJiighlj,...,hlp) 

!HPF$  ALIGN  B  (*1  WITH  T(A2i , . . . , h2j>.  :  ExtJow2j  :  ExtJiigh2j ,...,h2p) 

where; 

p  =  \DD{A)\ 
r  =  AD{A,j) 


I.** 


i"  =  AD(BJ) 

hlji  =  alj*kj  +  blj 
h2ju  —  a2f  *  kj  +  b2j 

In  the  above,  dimension  j  of  the  array  A  is  aligned  with  dimension  f  of  the  template  T.  Similarly,  dimension 
j  of  the  array  B  is  aligned  with  dimension  j"  of  the  template  T.  ExtJowlj  and  ExtJiighlj  are,  respectively, 
the  number  of  external  ghost  cells  at  the  beginning  and  end  of  dimension  j  of  the  Array  A.  Similarly,  ExtJow2j 
and  ExtJiigh2j  are  the  number  of  external  ghost  ceUs  at  the  be^nning  and  end,  respectively,  of  dimension  j 
of  Array  B. 

If  we  view  the  computation  as  accessing  elements  of  the  template  T,  then  the  effective  oibet  for  the  left 
hand  side  array  reference  along  dimension  j  of  the  array  A  is  dlj  *al  ^  +blj  —  Ext  Jowl  j .  Similarly,  the  effective 
ofGset  for  the  right  hand  side  array  reference  along  dimension  j  of  the  array  B  is  d2j  *  a2j  +  b2j  —  ExtJow2j . 
The  data  access  pattern  iq  the  forall  loop  will  require  a  regular  section  move  if  the  following  condition  holds: 
3j  €  DD(A)  s.t. 

1.  ilj  5^  *^map(j)  1  OT 

2.  clj  ♦  alj  ^  *  ^^tnap(j)t 

3.  dlj  ♦  alj  +  ~  ExtJowlj  ^  ^map(j)  *  ^^mapQ)  "f"  k2tnap(j)  —  ExtJow2map(j) 

and  either  of  these  effective  offsets  is  not  a  compile-time  constant,  or 

4.  il^  =  ^,i2j  =  and  blj  -  ExtJowlj  ^  b2map{j)-BxtJow2map{j)- 

As  in  Case  11,  the  first  condition  states  that  there  is  loop  index  permutation.  The  second  condition  implies 
that  there  is  a  difference  in  the  effective  stride  taken  along  any  dimension.  The  third  conditions  says  that 
the  offsets  may  be  distinct  and  one  of  them  is  not  a  compile- time  constant.  The  fourth  condition  says  that 
a  rectilinear  section  of  data  needs  to  be  moved  along  a  certmn  dimension.  Suppose  that  tl;  =  in  and 
i2map(j)  =  U3-  If  we  view  the  loop  iterations  as  accessing  elements  of  the  template  T,  cl^  *  stn  *  alj  is  the 
effective  stride  of  the  subscript  on  the  left  along  dimension  AD{A,j)  ruid  similarly  c2map(j)  *  *  o^map{j)  is 

the  effective  stride  of  the  subscript  on  the  right  hand  side  along  dimension  AD{B,Tnap{j)).  By  the  definition 
of  map{j),  AD{A,j)  =  AD{B,map{j)).  If  ilj  and  iimap(j)  are  identical,  then  ibl  =  *2.  So,  the  required 
condition  for  the  effective  stride  for  left  and  right  side  to  be  identical  is  clj  *  alj  =  c2map(j)  *  >^2map(j)- 

The  data  ar'  <as  pattern  will  require  filling  in  overlap  cells  if  the  following  condition  holds: 

1.  A  regular  section  move  is  not  required  and 

2.  3j  €  DD{A)  s.t. 

dlj  ♦  alj  -H  6lj  —  ExtJowlj  ^  d2fnap(J)  *  ®2,nap(j)  "b  ^^map(j)  ~  BxtJow2map{j) 

The  first  condition  says  that  we  have  not  already  decided  that  a  regular  section  move  was  required.  The 
second  condition  says  that  the  effective  offsets  are  not  identical.  In  Figure  4,  we  give  several  examples  showing 
the  results  of  the  analysis  for  Case  Ill. 

4.3  Generating  calls  to  the  runtime  library 

Once  the  nature  of  the  comiitnAication  required  has  been  identified,  the  compiler  must  insert  the  appropriate 
calls  to  the  runtime  primitives.  We  first  discuss  how  the  calls  ve  made  to  the  routines  for  filling  in  ghost  cells. 
We  discuss  this  in  the  context  of  Case  III  from  the  previous  section,  since  Case  II  is  really  a  special  case  of 
Case  111. 
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ALIGN  A(ij)  WITH  T(i  j) 


ALIGN  B(ij)  WITH  T(2*j ,1+3:2: 1) 


L.H.S. 

Expression 

R.H.S. 

Expression 

A(ij) 

B(iJ) 

YES 

NO 

A(ij) 

B(j,i) 

YES 

NO 

A(2*iJ) 

B(j,i) 

NO 

YES 

A(2*ij) 

B0‘,i+3) 

NO 

YES 

A(2*ij) 

Ba.i+1) 

NO 

NO 

Figure  4:  Analysing  communication  for  Case  III 

We  identify  each  distributed  dimension  j  of  the  array  A  for  which 

dlj  *  aij  +  61^‘  —  Ext — lowlj  ^  d2fnap(J)  *  o2fnap(j)  +  ^^map(j)  ~  ExtJow2fnap(j) ■ 

One  call  to  the  schedule  building  primitive  Overlap.CeUJ'UlSched  and  one  to  the  data  moving  primitive 
Data.Move  is  inserted  for  each  such  dimension.  Since  all  computations  are  distributed  using  the  owner 
computes  rule,  overlap  cells  are  filled  in  for  the  right  hand  side  array  B.  For  the  Overiap.CellJ'UUched 
call,  the  dimension  of  the  move  is  map(j)  and  the  number  of  overlap  cells  to  be  filled  in  is  d2mapU )  *  o^map^j )  + 
^imapU)  ~  “  dlj  *  ol;  +  6Ij  -  Extjowlj. 

The  schedule  building  primitive  is  called  with  the  Distributed  Array  Descriptor  (DAD)  of  the  array  as  a 
parameter.  The  actual  array  storage  location  need  not  be  specified.  A  call  to  Data.Move  is  then  made  which 
uses  the  previously  built  schedule  to  copy  the  data. 

We  now  discuss  how  calls  to  the  primitive  for  moving  regular  sections  are  inserted.  If  there  is  more  than 
one  array  on  the  right  hand  side,  then  the  analysis  described  in  Section  4.2  is  done  for  each  such  array.  For 
each  of  the  right  hand  side  arrays  which  requires  a  regular  section  move,  a  temporary  array  is  declared  and 
a  regular  section  move  is  done  from  the  right  hand  side  array  into  the  temporary  uray.  If  there  is  only  one 
array  on  the  right  side  (i.e.  the  forall  loop  represents  only  a  copy  and  does  not  have  any  computation),  then 
the  regular  section  move  is  performed  directly  from  the  right  band  side  array  to  the  left  hand  side  array. 

The  parameters  of  Regularjsection.move.sched  are  assigned  as  follows.  In  the  forall  loop,  ij  is  the  j'* 
loop  variable.  The  total  number  of  loop  variables  is  m.  For  each  of  the  loop  variables,  we  identify  the 
dimensions  corresponding  to  the  subscripts  of  A  and  B  where  they  appear.  Srcdim(j)  and  Destdim(j)  denote 
the  dimensions  of  the  source  array  B  and  the  destination  array  A  (or  a  temporary  array)  that  correspond 
to  the  dimension  of  the  regular  section  being  moved.  Note  that  Srcdim(j)  is  not  necessarily  j  since  the 
forall  loop  allows  arbitrary  permutation  among  dimensions.  If  I'Ui  =  ij  and  t2jb3  =  ij  ,  then  SrcDim(j)  is 
assigned  kl  and  DestDim(j)  is  assigned  ib2.  The  remaining  elements  of  Srcdims  and  Destdims  are  assigned 
the  remaining  dimensions  of  A  and  B  whose  subscripts  are  loop  invariant  scalars  (the  exact  ordering  is  not 
important). 

SrcLos(k)  and  SrcHis(k)  denote  the  start  and  end  points,  respectively,  along  each  dimension  of  the  source. 
If  i2t  =  0,  then,  SrcLos(k)  =  SrcHis(k)  =  d2t.  Otherwise,  if  i2k  =  ij,  for  some  j,  then,  SrcLos(k)  = 
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C  ORIGINAL  F90D  CODE 

C  Arrays  A,  B  and  C  are  distributed  identically 

FORALL  (i  =  l.lOOj  =  1:100)  A(ij)  =  B(i+1  j)  +  C(iJ) 

C  TRANSFORMED  CODE 

Dim  =  1 

No.0f-O!ls  =  1 

sched  =  Over]ap.Cel]JF'il]JSched(DAD,Dim,NoJOf-Cells) 
C  DAD  is  distributed  array  descriptor  for  A,  B  and  C 
C  i  is  dimension  1,  j  is  dimension  2 

Call  DataJiiove(B,sclied,B) 

LI  =  Local-Lower-Bound(DAD,l) 

L2  =  Local-Lower.Bound(DAD,2) 

HI  =  Local.Upper_Bound(DAD,l) 

H2  =  Local.Upper.Bound(DAD,2) 

do  10  i  =  LI, HI 
do  lOj  =  L2,  H2 

10  A(ij)  =  B(i+1  j)  +  C(iJ) 


Figure  5;  Overlap  cell  fUl  and  loop  bounds  adjustment  example 

c2t  *  (loj)  +  d2t  and  similarly,  SrcHis(k)  =  e2k  *  (hij)  +  d2t.  For  the  destination,  the  low  and  high  indexes 
are  computed  in  the  same  manner. 

SrcStr(k)  denotes  the  stride  of  the  move  along  each  dimension  of  the  source.  If  i2k  =  then,  SrcStr(k) 
doesn’t  matter.  Otherwise,  if  t2t  =  ij ,  for  some  j,  then,  SrcStr(k)  =  c2t  *  atj .  For  the  destination,  the  strides 
are  computed  in  the  same  manner. 

4.4  Distributing  loop  iterations 

Once  the  calls  have  been  inserted  for  communicating  the  required  array  elements,  the  loop  iterations  must  be 
distributed  among  the  processors.  As  we  stated  earlier,  this  is  done  using  the  owner  computes  rule.  Since  the 
distributed  array  descriptors  are  built  at  runtime,  it  is  not  possible  to  compute  the  locid  loop  bounds  on  each 
processor  at  compile-time. 

Consider  any  loop  variable  ij ,  Let  i2ki  =  ij-  The  loop  accesses  elements  ranging  from  cUi  *  loj  +  din  to 
cln  *  hij  +  dlti.  We  partition  the  loop  based  upon  the  portion  of  the  distributed  arrays  that  are  owned  by  a 
given  processor.  This  is  done  by  inserting  runtime  calls  to  the  the  library  primitives  LocalJtOwerJound  and 
LocaLUpperJound.  Note  that  for  arrays  which  are  not  in  canonical  form  (i.e.  where  elj  ^  1  or  dlj  /  0  for 
some  i  ),  we  can  still  partition  the  loop  based  upon  the  owners  compute  rule.  Consequently,  we  never  need 
to  scatter  any  data  after  the  loop  has  been  executed. 

In  Figure  5  we  show  an  example  of  how  the  calls  to  primitives  for  filling  in  overlap  cells  are  inserted  by  the 
compiler.  In  Figure  6,  we  show  how  the  compiler  inserts  calls  to  the  primitives  for  moving  regular  sections. 
In  both  examples,  the  transformed  code  containing  the  calls  to  the  runtime  library  will  run  as  SPMD  code 
on  each  processor  of  the  distributed  memory  parallel  machine.  Note  that  in  the  compiler  generated  code, 
schedule  building  primitives  will  be  called  every  time  any  forall  loop  requiring  communication  is  executed. 
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C  ORIGINAL  FOOD  CODE 

C  Arrays  A,  B  are  distributed  identically 

foraU  (i  =  l:100;2j  =  1:50)  A(iJ)  =  B(2*j,i) 

C  TRANSFORMED  CODE 

NamSrcDim=  2  NumDestDim  =  2 

SrcDiin(l)  =  2  DestDiin(l)  =  1 

SrcDiin(2)  =  1  DestDim(2)  =  2 

SrcLos(l)  =  2  De8tLos(l)  =  1 

SrcLo6(2)  =  1  De8tLos(2)  s  1 

SrcHis(l)  s  100  De8tHis(l)  =  100 

SrcHis(2)  s  100  DeatHis(2)  =  50 

SrcStr(l)  s  2  DestStr(l)  s  2 

SrcStr(2)  s  2  DestStr(2)  =  1 

Sched  s  Regular.Section-Move^ched(DAD, DAD, NumSrcDim, NumDestDim, 
SrcDim,  SrcLos,  SrcHis,  SrcStr, 

DestDim,  DestLos,  DestHis,  DestStr) 

Call  Data3love(B,Sched,A) 


Figure  0:  Regular  section  move  example 
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The  hand  coded  version  can  build  a  schedule  once  and  reuse  it  in  subsequent  iterations.  Similarly,  in  the 
compiler  generated  code,  runtime  calls  to  the  loop  bound  adjustment  primitives  will  be  made  each  time  a  loop 
is  executed.  The  hand  coded  version  can  reuse  the  adjusted  bounds  over  the  multiple  time  steps,  and  also 
for  multiple  loops  that  have  the  same  loop  bounds.  These  two  factors  may  cause  compiler  generated  code  to 
perform  worse  than  hand  parallelized  code. 

5  Experimental  Results 

In  this  section  we  present  experimental  results  to  demonstrate  the  efficacy  of  our  approach.  We  are  interested 
in  two  different  factors,  performance  of  the  library  primitives  and  the  effectiveness  of  the  compiler.  We  study 
the  first  factor  by  measuring  the  runtime  overhead  incurred  in  using  our  library  primitives  as  compared  the 
bare  cost  of  communication  associated  with  the  best  possible  hand  parallelized  codes.  We  study  the  latter 
factor  by  comparing  the  performance  of  compiler  parallelized  codes  with  the  codes  parallelized  by  manually 
inserting  calls  to  the  library  functions.  We  also  study  the  effect  of  data  distribution  on  the  performance  of 
these  codes. 

We  have  experimented  with  two  major  codes:  a  template  from  a  multiblock  Navier  Stokes’  solver  and  a 
semi-coarsening  multigrid  code.  We  have  parallelized  a  template  from  a  multiblock  computation  fluid  dynamics 
application  that  solves  the  thin-layer  Navier-Stokes  equations  over  a  3D  surface  (multiblock  TLNS3D),  using 
our  prototype  Fortran  90D  compiler.  The  multiblock  TLNS3D  code  we  are  working  with  was  developed  by 
Vatsa  et  al.  [40]  at  NASA  Langley  Research  Center,  and  consists  of  nearly  18,000  lines  of  Fortran  77  rode. 
The  template,  which  was  designed  to  include  portions  of  the  entire  code  that  are  representative  of  the  major 
computation  and  communication  patterns  of  the  original  code,  consists  of  nearly  2,000  lines  of  F77  code.  We 
have  also  worked  with  a  semi-coarsening  multigrid  code  [35].  This  has  nearly  2,500  lines  of  F77  code.  In  all 
the  experiments  described  in  this  section,  performance  data  is  presented  starting  from  the  minimum  number 
of  processors  which  provided  sufficient  memory  for  executing  the  program  up  to  32  processors. 

5.1  Overhead  of  Primitives 

We  are  interested  in  evaluating  the  performance  of  a  code  petrallelized  using  our  primitives  as  compared  the 
performance  of  the  “best  hand-parallelized”  code.  By  hand  parallelized  code,  here  we  mean  parallelized  by 
inserting  calls  to  the  communication  routines  provided  by  the  distributed  memory  machine  and  not  using  our 
library  routines.  Note  that,  to  the  best  of  our  knowledge,  no  complete  block  structured  code  has  yet  been  hand- 
parallelized  on  a  distributed  memory  parallel  machines.  The  reason  is  that,  for  an  application  programmer 
parallelizing  such  a  code  with  hand,  it  is  very  difficult  to  analyze  the  exact  communication  required  in  these 
codes  and  then  be  able  to  use  the  communication  routines  available  on  the  parallel  machine  to  handle  it 
efficiently.  The  use  of  library  primitives  involves  runtime  overheads  because  of  generating  schedules,  overhead 
of  copying  data  to  be  communicated  into  buffer  at  source  processors,  and  similarly  the  overhead  of  copying 
the  received  data  into  appropriate  memory  locations  at  the  destination  processors.  The  possible  advantage  (in 
terms  of  efficiency)  of  the  library  primitives  is  that,  for  each  invocation  of  a  data-move,  each  processor  sends  at 
most  one  message  to  each  other  processor.  It  may,  in  general,  be  very  difficult  for  an  application  programmer, 
parallelizing  the  code  by  hand,  to  do  such  message  aggregation.  However,  the  best  performance  that  an 
application  programmer  can  ever  achieve  will  only  have  the  cost  of  aictual  communication  and  computation, 
assuming  that  messages  have  been  aggregated  to  reduce  the  effect  of  communication  latencies.  We  will 
study  the  overheads  incurred  in  a  code  parallelized  using  our  primitives  as  compared  to  the  best  possible 
performance  of  hand  parallelized  code,  which  incurs  the  minimum  communication  costs  (assuming  maximum 
message  aggregation). 
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r  r  r  T  I  I 


Set  I  :  Communication  (Max.  message  aggregation) 

Set  II  :  Communication  iind  copying 

Set  III :  Communication,  copying  and  schedule  building 


Figure  7:  Performance  of  Primitives  on  iPSC/860  (100  iterations) 
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ONE  BLOCK:  49  X  9  X  9  Mesh  (50  Iterations) 


Number  of 

Processors 

Compiler 

Parallelized 

Hand 

Parallelized  F90 

Hand 

Parallelized  F77 

4 

6.99 

6.88 

6.20 

8 

4.17 

4.06 

4.00 

16 

2.47 

2.35 

2.28 

32 

1.55 

1.45 

1.41 

Figure  8:  Performance  comparison  for  small  mesh,  one  block  (in  seconds)  on  iPSC7860 

We  consider  a  simple  code  executed  on  2  processors  in  which  a  regular  section  move  involves  moving  data 
from  processor  0  to  processor  1.  We  vary  the  number  of  bytes  involved  in  the  regular  section  moves  and 
measure  three  sets  of  timings:  the  time  required  just  for  communication,  the  time  required  for  communication 
when  library  primitives  are  used  (excluding  the-  cost  of  schedule  building)  and  the  total  time  required  when  the 
library  primitives  are  used  (including  time  for  schedule  building).  The  first  set  of  timings  represents  the  best 
performance  that  hand  parallelization  can  achieve  if  ail  the  data  elements  to  be  communicated  are  laid  out 
contiguously.  If  the  data  elements  to  be  communicated  are  not  contiguous,  then  the  application  programmer 
will  need  to  do  copying  to  aggregate  message.  The  second  set  of  timings  represents  this  case.  The  third  set 
of  timings  represent  the  performance  with  the  use  of  library  primitives.  The  timings  presented  are  for  100 
iterations  of  the  regular  section  move. 

The  performance  results  on  an  iPSC/860  are  presented  in  Figure  7.  The  results  show  that  the  cost 
of  copying  (difference  between  Set  I  and  Set  II)  is  typically  a  small  fraction  (less  than  5%)  of  the  cost  of 
communication  for  most  of  the  cases.  Also,  if  the  schedule  built  is  used  over  a  large  number  of  iterations, 
the  cost  of  building  the  schedule  is  also  a  small  fraction  of  the  cost  of  communication.  We  have  performed 
similar  experiments  on  Thinking  Machines’  CM-5.  Our  experiments  have  shown  that  the  runtime  overheads 
associated  with  the  use  of  primitives  are  again  a  small  fraction  of  the  bare  cost  of  communication. 

5.2  Multiblock  Code 

We  have  parallelized  a  multiblock  template  using  our  compiler.  We  hand  parallelized  this  template  by  manually 
inserting  calls  to  the  multiblock  Parti  routines.  (Note  that  this  is  different  from  the  hand  parallelization  we 
talked  about  in  the  previous  subsection.)  We  converted  the  F77  (sequential)  code  to  F90D  manually,  by 
rewriting  the  the  major  computational  parts  of  the  code  using  forall  loops  and  F90  array  expressions,  also 
adding  the  required  data  distribution  directives.  We  then  parallelized  the  code  by  running  it  through  the 
F90D  compiler.  We  also  created  a  hand  parallelized  F90  version  of  the  template  in  which  all  computations 
are  done  with  single  statement  forall  loops,  but  the  calls  to  the  runtime  primitives  are  inserted  manually. 

We  now  compare  the  relative  performances  of  compiler  parallelized  F90  code,  hand  parallelized  F90  code 
and  hand  parallelized  F77  code,  varying  the  mesh  size  and  number  of  blocks  for  the  application,  and  also 
varying  the  number  of  processors  used  on  an  Intel  iPSC/860.  we  used  the  minimum  number  of  processors  In 
Figure  8,  we  present  the  performance  results  on  a  49  x  9  x  9  mesh  (with  one  block),  comparing  the  performance 
of  the  three  versions  from  4  to  32  processors.  In  Figure  9,  we  present  the  performance  results  on  a  49  x  17x9 
mesh  (split  into  two  blocks),  comparing  the  performance  of  the  three  versions  from  8  to  32  processors.  The 
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TWO  BLOCKS:  49  X  9  X  9  Each  (50  Iterationa) 


Number  of 

Processors 

Compiler 

Parallelized 

Hand 

Paridlelized  F90 

Hand 

Parallelized  F77 

8 

7.49 

6.69 

6.17 

16 

4.64 

4.07 

4.03 

32 

2.88 

2.32 

2.30 

Figure  9:  Performance  comparison  for  larger  mesh,  two  blocks  (in  seconds)  on  iPSC/860 

template  is  communication  intensive  and  therefore  the  absolute  speedups  are  not  very  high  in  either  of  the 
versions.  The  compiler  parallelized  F90  code  performs  within  around  20%  of  the  hand  parallelized  F77  code. 
The  hand  parallelized  F90  code  performs  worse  than  the  hand  parallelized  F77  code.  This  is  because,  in  the 
F90  version,  all  computation  is  done  through  single  statement  forall  loops  that  result  in  the  creation  of  (large) 
temporary  arrays.  Such  use  of  temporary  storage,  and  the  fact  that  no  loop  fusion  between  parallel  loops 
is  done  by  the  compiler,  increases  the  number  of  cache  misses  on  each  processor.  However,  the  difference 
in  performance  between  the  F90  and  F77  hand  parallelized  versions  decreases  as  the  number  of  processors 
increases.  This  is  because  as  the  number  of  processor  increases,  less  memory  is  required  on  each  processor,  so 
the  effect  on  cache  utilization  is  less  significant.  The  difference  in  performance  of  the  hand  parallelized  F90 
and  the  compiler  parallelized  code  comes  from  two  major  factors.  First,  in  the  compiler  generated  version,  the 
runtime  calls  for  computing  new  loop  bounds  are  made  in  each  loop  iteration,  as  compared  to  only  once  for 
the  hand  parallelized  version.  Second,  as  the  template  is  run  over  a  large  number  of  time  steps,  the  compiler 
generated  version  makes  repeated  calls  to  the  runtime  library  to  build  communication  schedules,  whereas  in 
the  hand  parallelized  version  the  calls  are  lifted  out  of  the  time  step  loop.  To  reduce  the  additional  cost 
due  to  this  second  factor,  our  runtime  library  library  saves  schedules.  When  a  call  is  made  for  generating 
the  schedule,  the  library  searches  a  hajh  table  to  check  if  any  schedule  with  exactly  the  same  parameters  is 
present.  If  so,  the  saved  schedule  is  returned.  This  technique  still  has  this  overhead  as  compared  to  a  hand 
parallelized  version.  To  study  the  exact  costs  of  each  of  these  factors,  we  present  a  more  detailed  experiment 
in  Section  5.4. 


5.3  Multigrid  Code 

We  have  also  experimented  with  a  semi-coarsening  multigrid  code  developed  by  Rosendale  and  Overman  [35]. 
This  has  nearly  2,500  lines  of  F77  code.  We  discussed  the  semi-coarsening  multigrid  technique  and  the  mapping 
policy  used  in  parallelizing  such  an  application  earlier  in  Section  2. 

We  rewrote  this  code  using  forall  loops  and  including  the  distribution  directives  and  then  parallelized 
it  using  our  compiler.  This  code  had  also  been  parallelized  by  inserting  the  calls  to  the  library  routines 
manually  [35].  In  Figure  10,  we  show  the  performance  comparison  of  these  two  parallel  versions  run  on  Intel 
iPSC/860.  The  results  are  for  a  32x32x32  grid,  using  a  coarsening  factor  of  4  along  each  dimension.  The  code 
uses  8  different  grids  at  four  different  levels.  We  did  not  create  a  separate  hand  parallelized  F90  version  since 
most  of  the  subroutines  in  this  code  are  fairly  small  and  therefore  rewriting  it  in  F90  using  forall  loops  did  not 
involve  introducing  large  temporary  arrays.  Consequently,  we  did  not  expect  to  see  any  notable  difference  in 
the  performance  of  hand  parallelized  F90  and  F77  versions. 

The  results  of  the  performance  of  compiler  parallelized  and  band  parallelized  multigrid  code  are  presented 
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No.  of 

Proc. 

Compiler: 

1**  iteration 

Compiler: 

Per  Subsequent 
Iteration 

By  hand: 

1**  iteration 

By  hand: 

Per  Subsequent 

Iteration 

8 

4.80 

2.29 

4.60 

2.14 

16 

3.84 

1.38 

3.41 

1.35 

32 

3.03 

.95 

2.48 

.88 

Figure  10:  Semi-Coarsening  muitigrid  performance  (in  seconds)  on  iPSC/860 


TWO  BLOCKS  :  25  X  9  X  9  Each  (50  Iterations) 


No.  of 

Proc. 

Compiler 
Version  I 

Compiler 
Version  11 

Compiler 
Version  III 

Compiler 
Version  IV 

Hand 

F90 

4 

13.45 

7.63 

7.41 

7.33 

6.79 

8 

15.51 

4.78 

4.58 

4.54 

4.19 

16 

11.72 

2.85 

2.71 

2.62 

2.39 

32 

8.01 

1.85 

1.79 

1.66 

1.47 

Version  I 
Version  II 
Version  III 
Version  IV 


:  Runtime  Library  does  not  save  schedules 
:  Runtime  Library  saves  schedules 
:  Schedules  reuse  implemented  by  hand 
:  Loop  bounds  reused  within  a  procedure 


Figure  11:  Effects  of  various  optimizations  (in  seconds)  on  iPSC/860 

in  Figure  10.  The  results  show  that  the  compiler  parallelized  code  performs  within  10%  of  the  hand  parallelized 
code  in  this  case.  Again,  as  this  code  was  very  communication  intensive,  the  absolute  speedup  is  not  very  high 
for  either  version. 

5.4  Compiler  Optimizations 

In  Figure  1 1 ,  we  study  the  effect  of  the  compiler  optimizations.  Version  I  is  a  compiler  parallelized  version 
in  which  the  library  does  not  save  any  schedules.  This  version  performs  badly  because  of  the  high  cost  of 
rebuilding  the  schedules  for  every  iteration.  Version  II  is  the  compiler  parallelized  version  in  which  the  library 
saves  schedules.  This  results  in  a  major  gain  in  performance.  We  now  discuss  some  optimizations  which 
are  not  implemented  in  the  current  compiler.  We  studied  the  effect  of  these  optimizations  by  modifying  the 
compiler  generated  code  by  hand.  Version  III  represents  the  case  where  the  compiler  performs  sophisticated 
interprocedural  analysis  to  reuse  the  schedules  during  successive  time  steps.  Version  III  performs  better  than 
version  II,  in  which  the  schedules  are  reused  within  the  library,  but  the  difference  is  not  large. 

In  the  compiler  parallelized  version,  runtime  calls  are  made  to  the  functions  for  adjusting  loop  bounds  for 
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TWO  Bl 

LOCK:  49  X  17  X  9  Mesh  (50  Iterations) 

Number  of 

Processors 

Blocks  Mapped 
Entire  Proc.  Space 

Blocks  Mapped 
Disjoint  Proc.  Spaces 

4 

8.99 

7.59 

8 

5.14 

4.74 

16 

3.24 

2.83 

32 

2.41 

1.87 

Figure  12'.  Effect  of  Data  Distribution  on  iPSC/860 


each  forall  loop  on  each  time  step.  The  hand  parallelized  version  can  store  the  loop  bounds  computed  during 
the  first  time  step,  for  subsequent  reuse.  Additionally,  a  procedure  may  contain  several  loops  involving  the 
same  array  on  the  left  hand  side  that  have  the  same  loop  bounds.  Our  compiler  generates  separate  runtime 
calls  for  adjusting  loop  bounds  for  each  such  loop.  Such  optimizations  will  be  implemented  in  a  future  version 
of  the  compiler. 

In  Figure  11,  the  difference  between  version  III  and  t!ie  hand  parallelized  F90  version  shows  the  extra 
cost  of  generating  loop  bounds  at  runtime  for  each  forall  loop  during  each  time  step.  The  results  show  that 
generating  loop  bounds  at  runtime  is  the  major  factor  in  the  performance  difference  between  the  compiler 
parallelized  version  and  the  hand  parallelized  versions.  In  version  IV,  we  show  the  results  of  an  unimplemented 
optimization  in  which  the  compiler  is  able  to  identify  the  loops  with  the  same  left  hand  side  array  and  same 
loop  bounds  within  a  subroutine.  Then  the  compiler  needs  to  generate  calls  to  the  loop  bound  adjustment 
functions  only  once  for  each  such  set  of  loops.  This  optimization  also  provides  an  improvement  over  version 
III. 

5.5  Effect  of  Data  Distributions 

As  we  discussed  earlier,  one  of  the  features  of  our  runtime  library  is  the  ability  to  map  arrays  (or  templates) 
to  subsets  of  the  processor  space.  In  the  current  definition  of  HPF  (and  hence  in  HPF  compilers),  this  is 
not  possible.  In  block  structured  codes,  this  feature  allows  us  to  keep  the  communication  overheads  low 
while  maintaining  the  load  balance.  To  study  the  benefit  of  this  feature,  we  experimented  with  the  multigrid 
template  described  above  for  two  block  case.  We  ran  the  parallelized  code,  once  distributing  both  the  blocks 
over  the  entire  processor  space  and  then  distributing  each  block  over  disjoint  processor  spaces.  The  results  on 
Intel  iPSC/860,  shown  in  the  Figure  12,  show  that  the  latter  scheme  improves  the  performance  by  nearly  10 
to  25%.  Since  there  is  no  difference  in  the  net  computation  performed  at  each  Processor  in  either  of  the  the 
two  cases,  this  difference  comes  because  of  the  increased  amount  of  communication  required  when  each  block 
is  distributed  across  the  entire  processor  space.  Mapping  a  block  over  a  large  number  of  processors  increases 
communication  arising  from  near  neighbor  interactions  during  the  regular  computation  within  blocks.  Note 
that  a  10  to  25%  degradation  in  performance  occurs  when  there  are  only  two  blocks.  We  expect  that  with  a 
larger  number  of  blocks,  the  difference  in  the  performance  would  be  much  more  severe. 
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6  Conclusions 


To  reliably  and  portably  program  distributed  memory  parallel  machines  ,  it  is  important  to  have  both  a 
machine  independent  language  and  runtime  support  for  optimizing  communication.  High  Performance  Fortran 
and  its  variants  have  emerged  as  the  most  likely  candidates  for  machine  independent  parallel  programming  on 
distributed  memory  machines.  One  class  of  scientific  and  engineering  applications  involves  structured  grids  or 
meshes.  These  meshes  may  be  nested  (as  in  multigrid  codes)  or  may  be  irregularly  coupled  (called  multiblock 
codes  or  Irregularly  Coupled  Regular  Meshes).  Multiblock  and  multigrid  codes  form  a  significant  part  of 
scientific  and  engineering  applications. 

In  this  paper  we  have  addressed  the  problem  of  runtime,  compiler  and  programming  language  support 
for  parallelizing  this  important  class  of  applications  on  distributed  memory  machines.  We  have  designed  and 
implemented  a  set  of  runtime  primitives  for  parallelizing  these  applications  in  an  efficient,  convenient  and 
machine  independent  manner.  The  runtime  primitives  give  ability  to  specify  data  distributions,  perform  com¬ 
munication  and  distribute  loops  based  on  data  distributions  specified  at' runtime.  One  of  the  communication 
primitives  in  our  library  is  the  regular  section  move,  which  can  copy  a  rectilinear  part  of  a  distributed  array 
onto  a  rectilinear  part  of  another  distributed  array,  potentially  involving  index  permutations,  change  of  strides 
and  change  in  offsets.  We  have  presented  runtime  analysis  which  can  implement  this  communication  primitive 
efficiently. 

For  making  the  task  of  application  programmers  easy,  it  is  important  to  have  compiler  support.  In  this 
paper,  we  have  presented  techniques  that  can  be  used  by  compilers  for  HPF-style  programming  languages  to 
automatically  generate  calls  to  the  runtime  primitives.  We  have  presented  the  method  by  which  the  compiler 
can  analyze  the  data  access  patterns  associated  with  parallel  loops  and  therefore  identify  communication 
patterns  which  can  be  efficiently  handled  using  the  communication  primitives  that  the  multiblock  Parti  library 
supports.  We  have  also  presented  compiler  transformations  that  the  compiler  performs  for  automatically 
generating  calls  to  the  runtime  primitives. 

We  have  implemented  the  compiler  analysis  method  in  the  Fortran  90D  compiler  being  developed  at 
Syracuse  University.  We  consider  this  work  to  be  a  part  of  an  integrated  effort  toward  developing  a  powerful 
runtime  support  system  for  a  F90D  compiler.  We  have  experimented  with  a  template  from  a  3D  multiblock 
Navier-Stokes  solver  and  a  multigrid  code. 

For  demonstrating  the  efficacy  of  our  approach,  we  examined  two  separate  factors:  performance  of  the 
runtime  primitives  and  performance  of  compiler  parallelized  code  as  compared  to  the  code  parallelized  by 
inserting  calls  to  the  runtime  primitives  by  hand.  We  examined  the  additional  cost  of  using  library  primitives 
as  compared  to  the  minimum  cost  of  communication.  Performance  results  show  that  the  additional  cost  in 
using  the  library  primitives  (schedule  building  and  data  copying)  is  a  small  fraction  of  the  minimum  cost 
of  communication.  One  of  the  features  of  our  library  which  is  not  supported  in  current  version  of  HPF 
(and  consequently  in  HPF  compilers)  is  the  ability  to  map  arrays  or  blocks  over  part  of  the  processor  space. 
We  presented  results  with  TLNS3D  template  to  show  the  improvement  in  performance  achieved  because 
of  this  feature.  We  compared  the  performance  of  compiler  parallelized  code  with  the  performance  of  hand 
parallelized  F90  and  F77  codes,  and  have  shown  that  the  compiler  parallelized  code  performs  within  20%  f 
hand  parallelized  F77  code.  The  optimization  of  having  the  runtime  library  save  and  reuse  communication 
schedules  allows  the  compiler  parallelized  code  to  perform  almost  as  well  as  hand  parallelized  code.  We  have 
also  experimented  with  other  optimizations.  The  optimization  of  reusing  computed  loop  bounds  within  a 
subroutine  improves  the  performance  of  the  compiler  parallelized  code  and  brings  it  within  10%  of  the  hand 
parallelized  version. 

To  the  best  of  our  knowledge,  we  are  not  aware  of  any  real  block  structured  codes  which  have  been  paral¬ 
lelized  on  any  distributed  memory  machine.  The  reason  is  that,  for  an  application  programmer  parallelizing 
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such  an  application  by  hand,  it  is  very  difficult  to  analyze  the  exact  communication  required  and  then  to  be 
able  to  use  the  communication  routines  provided  by  the  machine  to  communicate  efficiently.  Our  runtime 
and  compiler  support  can  be  used  to  parallelize  such  applications  conveniently.  Our  experimental  results  have 
shown  that  the  code  parallelized  by  using  the  compiler  will  have  only  a  small  overhead  as  compared  to  the 
best  hand  parallelized  code  (i.e.  parallelized  by  invoking  system’s  communication  primitives  by  hand). 

While  the  design  of  our  runtime  system  was  motivated  by  multiblock  and  multigrid  applications,  our 
runtime  primitives  can  be  used  in  many  cases  for  regular  codes  as  well.  We  therefore,  believe  that  our  runtime 
support  and  compiler  techniques  can  be  used  by  compilers  of  HPF-style  parallel  programming  languages  in 
general. 
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